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Abstract 

Gaussian process is a very promising novel technology that has been applied to both 
the regression problem and the classification problem. While for the regression problem 
it yields simple exact solutions, this is not the case for the classification problem, 
because we encounter intractable integrals. In this paper we develop a new derivation 
that transforms the problem into that of evaluating the ratio of multivariate Gaussian 
orthant integrals. Moreover, we develop a new Monte Carlo procedure that evaluates 
these integrals. It is based on some aspects of bootstrap sampling and acceptance- 
rejection. The proposed approach has beneficial properties compared to the existing 
Markov Chain Monte Carlo approach, such as simplicity, reliability, and speed. 



1 Introduction 



Gaussian process classification (GPC) is a promising Bayesian approach to the classification 
problem. It is based on defining a so-called latent variable for every pattern, and setting 
the prior based on proximity relations among the patterns. Based on the prior and through 
a series of integrals, the posterior of a test pattern will determine its classification (see the 
reviews of Rasmussen and Williams [36], Nickisch and Rasmussen [31], and Seeger [IQ]). 
Unfortunately, this resulting multi-integral formula is intractable, and can only be solved 
through some approximations or using lengthy algorithms. In spite of its superior perfor- 
mance (as has been pointed out in several application studies, such as by Altun, Hofmann, 
and Smola [2], Jenssen et al [23], and Bazi and Melgani [1]), this computational issue ham- 
pers its wide applicability to large data sets. In this paper, we propose a new simplified, 
but exact, formula for the binary GPC problem. The proposed method is based on applying 
some substitutions and transformations that convert the problem into that of evaluating a 
ratio of two orthant integrals of a multivariate Gaussian density. Moreover, we develop a new 
Monte- Carlo- type algorithm to evaluate these orthant integrals. The proposed algorithm is 
based on some aspects of bootstrap sampling and acceptance-rejection. 

The approaches in the literature for the GPC problem can be categorized into solu- 
tions based on analytical approximation of the integrals and solutions based on Monte 
Carlo sampling (see the extensive review of Nickisch and Rasmussen [31] ) . Among the well- 
known proposed methods from the first category are the Laplace's approximation (Williams 
and Barber [H]) and the expectation propagation (Minka [30]). Also, some other efficient 
approximation-based methods include the work of Csato et al [B], Opper and Winther [35]). 
Gibbs and MacKay Rifkin and Klautau |EH], Jaakkola and Haussler [22j, and Kim, 
and Ghahramani [26]. Work on the second category has been more scarce. Almost all of 
the approaches are based on the Markov Chain Monte Carlo (MCMC) concept, such as the 
approach by Neal [33]. Neal's so-called Hybrid Monte Carlo (HMC) is based on the concept 
of "importance density". Murray, Adams, and Mackay [32] proposed the Elliptical Slice 
Sampler (ESS). It is based on sampling over an elliptical shce to efficiently obtain the step 
size. Titsias, Lawrence and Rattrpay [17] introduced a novel MCMC approach by making 
use of a low dimensional vector of control variables (see also Titsias and Lawrence [18]). 
Vehtari, Sarkka, and Lampinen [51] proposed a novel way to choose effective starting values 
for the MCMC based on early stopping. Barber and Williams [3] proposed a hybrid method, 
that is uses partly an approximation and partly the MCMC procedure. 

Nickisch and Rasmussen [31] ) provided a comprehensive review and comparison between 
the different GPC approximations and the MCMC approaches. They came to the conclusion 
that the MCMC-type approaches are superior in performance, but of course computationally 
more extensive. This is because they can obtain the exact solution of the integral formula, 
provided the size of the run is large enough. 

Kuss and Rasmussen [2S] compared between a number of GPC approximations. Rather 
than improving the approximation ability or the computation speed, some people considered 
short-cuts to the the GPC model itself to achieve better efficiency or performance, such as 
sparse GPC models (see Csato and Opper [7] Vanhatalo and Vehtari [50], and Titsias and 
Lawrence ^B]), and low-dimensional manifold embedding (see Ursatun and Darrell [33] ) . 

A majority of the work in the literature and the above reviewed methods consider the 
binary classification case. Nevertheless, some researchers extended The GPC problem to the 
multi-class case (for example Girolami and Rogers [16], and Seeger and Jordan [2]). 



The method proposed in this paper fits more into the second category described above 
(i.e. exact Monte Carlo-based), but it is not based on the MCMC concept. It is guaranteed 
to converge as close as possible to the exact solution of the multi-integral formula, provided 
we use a large enough sample of generated points. The advantages of the proposed algo- 
rithm is that it does not require any parameter-tuning (other than specifying the number of 
Monte Carlo generated points), is consistent, and reliable. (In short it works all the time, 
we tested hundreds of problems, some as high as 2000 dimensional problems.) It also com- 
pares favorably in terms of speed and accuracy to the other MCMC approaches, especially 
for the evaluation of the marginal likelihood. The marginal likelihood is an expression for 
the likelihood of the data given the parameters. The hyperparameters are typically tuned 
by optimizing the marginal likelihood function. Moreover, in contrast with several MCMC 
approaches, the accuracy for the proposed algorithm is similar for most possible hyperpa- 
rameter combinations of the covariance matrix, thus allowing easy hyperparameter tuning. 
A beneficial aspect of the proposed integral formulation is that it gives many insights into the 
different influencing factors. For example, one can obtain the limiting behavior of the covari- 
ance matrix parameters, and therefore understand the classification behavior when moving 
their values in certain directions. Also, the other advantage of the integral formulation is 
that it is given in terms of multivariate Gaussian orthant integrals. These are well-researched 
integrals, and several approximations exist in the literature. So, this could possibly open the 
way for new competitive approximations to the GPC problem. 

The paper is organized as follows. Next section we present an overview and definition 
of the GPC problem. In Section 3 we propose the new formulation that is obtained by 
simplifying the multi-integral formula. Section 4 presents an overview of the multivariate 
Gaussian integral that has to be evaluated. In Section 5 we propose the new Monte Carlo 
model for evaluating the integral. Section 6 provides the experimental study to assess the 
new model, and Section 7 is the conclusion. 

2 The Gaussian Process Classification Problem 

Gaussian process classifiers (GPC) are based on defining a "latent state" fi for every training 
pattern. It is a central variable in the formulation which measures some sort of degree of 
membership to one of the classes. Let yi denote the class membership of training pattern i, 
where Hi = I denotes Class 1 and Ui = —1 denotes Class 2. The latent variable /j, whose 
range is from — oo to oo, is mapped into class posterior probability through a monotone 
squashing function a that has a range of (0, 1), as follows. 



There are two typical forms for a in the GPC literature: the logit (or logistic function) and 
the probit (or cumulative Gaussian integral). As argued by Nickisch and Rasmussen [31], 
both choices are effectively quite similar. In this work we consider only the probit function. 

In what is next we will follow closely the terminology of Rasmussen and Williams [36] . 
Let us arrange the latent variables and the class memberships in one vector each: / = 
(/i! • • • ) In)'^^ and y = {yi, . . . , j/at)"^. Note that each index of the afforementioned vectors 
pertains to a specific training pattern, and is the size of the training set. Let Xi be the 
feature vector of training pattern i. Moreover, let us arrange all training vectors Xi as rows 




(1) 



in a matrix X. Let x* denote the feature vector of the test pattern, whose class needs to be 
evaluated. Let its latent state be /=„. 

The latent state vector / obeys an a priori density that is assumed to be a multivariate 
Gaussian (therefore the name Gaussian processes). From an a priori point of view, patterns 
that are close (in the features space) are more likely to belong to the same class. So this 
prior density is selected to reflect that property. Patterns with nearby feature vectors have 
highly correlated latent variables fi, and as the patterns become more distant the correlation 
decays. The a priori density can be written as 

p(/|X)=Ar(/,0,S) (2) 

where Af{f, /i, S) denotes a Gaussian density of variable / having mean vector /i and co- 
variance matrix E. The covariance matrix has elements that are a function of the distance 
between two feature vectors — and is so designed to achieve this aforementioned cor- 
relation behavior (see Rasmussen and Williams for a detailed discussion and examples 
of covariance functions). A particularly prevalent choice of the covariance matrix is the so 
called "RBF" covariance matrix, given by: 



(3e 



(3) 



The a and fi parameters (called respectively the length scale and the latent function scale) 
are very influential in the performance of the classifier, and tuning them has to be done with 
care (see Sundarajan and Keerthi |15]). More will be said later on them. 

A test pattern's latent variable will have similar correlation structure as the training 
patterns. Consider the augmented training latent state vector and test point latent state. It 
is given as Gaussian, as follows: 
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(4) 



where T^xx* is the covariance of x^, and X, and S^...^, is the variance of A key to estimating 
the class membership of the test point is to evaluate the probability density of its latent state 
conditional on all the information that is available from the training set: 



(5) 



p{fAX,y,x^) = j p{f^\X,x^J)p{f\X,y)df 

where p{f\X,y) = ^^^pj^p/j^'' is the posterior of the latent variables. 

Then, we compute the the probability of Class 1 averaged over the conditional density 
of /*: 

J* = P{y* = +MX, y, x^) = J a{f^)p{f^\X, y, x^)df^ (6) 
where we used the fact that cr{f^:) signifies the conditional given in Eq. ([T]). We get 

Jp{MX,x,J)p{y\f)p{f\X)df 



Pif*\X,y,x^) 



Piy\x) 



(7) 



where 



N N N — — 

Pivlf) = I[piy^\f^) = EcTiyJ,) = n / ^dx, (8) 
i=i i=i i=i_-^^ V^vr 

where we used the fact that a is the probit function (integral of the Gaussian function). 
Note that P{yi = -l\fi) = 1 - P{yi = l\fi) = 1 - a(/j) = cr{-fi) = a{yifi) because of the 
point symmetry of a. Also, 

p{U\X,x,J)=U{U,Jf,al) (9) 

where 

a = S-^Sxx, (10) 



C"* — ^x.x* — ^Xx.^ ^^Xx, (11) 

We utilized formulas expressing the conditional of a multidimensional Gaussian distribution, 
applied to Eq. 

All past formulas follow from straightforward probability manipulations, and they are 
described clearly in Rasmussen and Wilhams [36], p. 16 Eq. 2.19. Equation (|6]), representing 
the posterior probability of the test pattern x*, is the main quantity needed to classify the 
pattern. Thus, J^, > 0.5 means that the pattern should be classified as Class 1, and otherwise 
it should be classified as Class 2. 

Another important quantity that is needed is the so-called marginal likelihood L, defined 

as 



L=p{y\X) (12) 

It is an important quantity for the purpose of tuning the two hyperparameters [a and /3). By 
maximizing the marginal likelihood, we arrive at hyperparameters that are most consistent 
with the observed data. As such, any method should also be able to efficiently evaluate 
the marginal likelihood. See Rasmussen and Williams [36] for more information about the 
marginal likelihood. 



3 The Proposed Simplification of the Multi-Integral 

3.1 Variable Transformation 

Evaluating Equations (Q, ([7]) analytically is very hard to accomplish. The difficulty arises 
also when attempting to evaluate them numerically because of the high dimensionality of 
the integrals (for example for a problem with a training set of size 1000 we are dealing with 
more than a thousand-fold integral). Also, attempting a standard Monte Carlo evaluation 
leads to some practical problems, among them is the fact that Jlili <^{yifi) turns out to be 
usually a very small number (with a negative exponent with a very large magnitude). So 
to summarize, we are dealing with a very hard problem if an exact solution is to be sought. 
Here we develop a procedure that transforms the problem into the more approachable form of 
integrals of multivariate Gaussian functions. Specifically, we perform some substitutions and 
transformations of variables that will convert the problem into evaluating orthant integrals of 
some multivariate Gaussian density. By orthant integral we mean an integral of a zero-mean 



multivariate Gaussian function over some quadrant, e.g. over x > 0. The detailed steps are 
given below. 

Substituting ([HD, (0), (ED and ([7]) into ©, we obtain: 
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Af{f, 0, S) Ar(/„ a^/, al)dhdh..dfNdf, 



(13) 



Rearranging, we get 
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(14) 



where 



i=l 



(15) 



Rewriting W in matrix form: 
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The problem with this integral (J*) is that some of its 



variables (/*, /i, /jv ) occur in the limits of the integrals. A transformation can fix this 
problem by using the substitution (see the preliminary work of [1]): 
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We get 



J * 



OO OO OO OO OO v'^ Dv 
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(27r)^+V,|S|5p(y|X) 
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where 
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where C 



yi 



2/2 



Vn 



and / is the identity matrix (in both cases in the formula 



it is X A^). The integration can then be put in the form: 

1 



A/" (?;, 0, L>"M dv 



(19) 



D\-2\j:\-2p{y\x)(r, 

This integral above is called orthant normal integral where the orthant is defined over 

Vn+2 

< OO. Let us denote this orthant by orth. Con- 
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Vn+1 



> and — OO < 



V2N+2 



sider now the term p{y\X). Integrating (jTj) w.r.t. from — oo to oo, and using the fact that 
Ipif*\X,y,x^)df^ = 1 we get 



Pivlx) = J JfPiMX,x,J)piy\f)p{f\X)dfdf, 
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p{f\X)dfdf, 



(20) 



The integral / ^-^dx in the previous equation is inserted on purpose. It equals 1 so it 



will not alter the formula. The above integral will be the same as f[T9|) except that the limits 



will be over the orthant defined by 



V2 



Vn+1 



> and — oo < 



Vn+2 



V2N+2 



< OO. The reason 



is that the above analysis that led to (ITU]) will apply here except that one of the integral's 
limits is different (the Gaussian integral with variable x that is inserted in fl20|) . here the 
upper limit is oo instead of /*). Let us denote this orthant by orth+. The expression for the 
posterior will then be given as. 



J* = p{y* = Mx,y,x*) 



U,Afiv,0,D-^)dv _ h 
J,,,f,^Af{v,0,D-^)dv h 



(21) 



3.2 Further Reduction: 



The limits for the portion ■ ■ ■ , V2n+2)'^ in Expression fl2T]) are from — oo to oo, so these 

variables can be integrated out. This means that the expression can further be reduced from 
a 2N + 2 dimensional integral to an + 1 dimensional integral. The detailed steps of this 
reduction are given below. 
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where the D matrix defined in ( JTSjl is written as D 



. We can write: 
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and /cs = (27r) 2 \A22\ ^\D\2. Some ma- 
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Now the inside integral w.r.t. v' equals 1. We get 

/i = A;3(27r)^|y4|-^ j M {y,Q, A'^) dv, 



(23) 



v>0 



where 



A = I- A12A22M12 (24) 
A similar formula applies for I2 but with integration limits given by — 00 < fi < 00 and 



V2 



Vn+1 



> 



Some further simphfication (see the Appendix for a detailed proof) leads to 

I + Ai2Y!Ai2 (26) 



A-^ = R 



(25) 



where S' is the composite covariance function (of training patterns plus testing pattern, see 
Eq. dU): 



yT 



The final classification posterior probability is thus given by 

Uh-^i^^0^I + Al2^'Ai2)dv ^ h 
Iorth+^iv,0,I + Al2^'Ai2)dv I2 



J* = p{y* = MX,y,x^) 



(27) 



(28) 



where orth means the integration over f > 0, and orth^ is the integration over the region 



p{y\X), which represents the marginal hkehhood as described in Eq. (fT2|) and beyond, is 
basically /2, or the denominator of the expression for J*. Thus, 



The last identity is obtained by noting that the limit for vi in Eq. f l2^ stretches from —oo 
to oo and therefore vi can be integrated out. The vector v' = {v2, ■ ■ ■ ,vn+i)'^ corresponds 
to the variables of the training set, while vi corresponds to the test pattern. 

3.3 Some Insights 

The denominator, as mentioned, represents the marginal likelihood p{y\X). Essentially, 
the marginal likelihood, as also observed from the integral, measures how well the class 
memberships pi fit with the covariance structure. If the patterns of I's and -I's multiplied 
with the components of E' (through A12) emphasize the large covariance elements (through 
agreeing class memberships, i.e. yiHj = 1) and deemphasizing the small covariance elements 
(through disagreeing class memberships, i.e. the multiplied by the factor yiyj = —1), then 
we have a relatively large marginal likelihood, and therefore a fairly consistent model. The 
reason why the positive orthant integral will have higher value for large positive covariances 
(rather than negative) is that the largest principal components will then be oriented closer 
to the orthant. 

The classification posterior probability J^: is the ratio of the integral over an orthant, 
and the integral is over orth+ which basically extends over two orthants. This makes the 
expression appropriately smaller than 1. Observing the numerator, we find that the predom- 
inant signs (in yi and therefore also ^412) that multiply large covariance elements (in S^x*) 
will determine if the numerator has a large value (therefore the pattern should be classified 
as Class 1), or a small value (therefore the pattern should be classified as Class 2). Note 
that the identity matrix component of the covariance matrix in Eq. ( 126|) is some kind of 
regularizing factor that acts to "fatten" density. 

Let us now consider the effects of the hyperparameters. Consider the RBF kernel (Eq. 
([2])). The analysis here complements and in many ways confirms the insightful analysis of 
Nickisch and Rasmussen [31]. When either the latent function scale f3 — > or the length 
scale a — > then the covariance matrix in the multivariate Gaussian tends to the identity 
matrix. In case of identity matrix the orthant integral equals 2~'^ where d is the dimension. 
So, we end up with L = and J* = |. If /3 — > 00 then the identity matrix part of 
the covariance matrix becomes negligible compared to the other part, see Eq. fl2B]) . and we 
get a formula similar to Eq. fl25]) . but without the identity matrix added in the covariance 
expression. If a — > 00, then we get the following formula. To avoid distraction to side 



given by — 00 < fi < 00 and 



> As we have demonstrated in Eq. (120|) . the term 
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(30) 



issues, the proof is not given here. 
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(31) 



where a is the cumulative Gaussian integral (i.e. the integral of the one- dimensional Gaussian 
density), and A^i (A^2) is the number of Class 1 (Class 2) training patterns. Essentially, this 
gives some kind of a "soft" counting procedure, without regards to the distances involved. 
One can see that all test patterns will be classified as only one specific class (the one that 
wins the counting game). 



4 The Multivariate Gaussian Integral 

As can be seen the final equations f l28|) . f l29|) . and fl30|) for the posterior probability and 
the marginal likelihood are given in terms of two multivariate Gaussian integrals. The 
difficulty is that these are very high dimensional integrals (the dimension equals the size of 
the training set), making that a formidable problem. Consider that we would like to apply 
the standard approach of generating many points according to the multivariate density, and 
then computing the fraction of points that fall in the considered orthant (area of integration). 
Such high order integrals are typically a very small number, for example 10~^°. So even if 
we generate trillions of points, essentially no point will happen fall in the area of integration. 

There is a large literature on the multivariate Gaussian integral. Interest in the problem 
started around the forties of last century, and research is continuing since then, see the 
reviews of Gupta [IT] and [18], Johnson [21], and Genz [H]. Essentially the state of the art 
is that a closed form formula exists only for a dimension up to three for the centered case 
(i.e. the integrals for the zero mean case where the limits are from 0, as is our case, see 
Eriksson [H]), and no closed form formulas exist for the non-centered case. There are some 
special constructions of the covariance matrices for which simplified formulas exist (for any 
arbitrary dimension). There has also been some series expansions for the centered case in 
terms of the elements of the covariance matrix (Kendall [25], and Moran [31]), or in terms 
of the elements of the inverse of the covariance matrix (Ribando [38]). These formulas, 
while very elegant and insightful, are intractable for dimensions larger than ten, because of 
the huge number of combinations of powers of the N"^ variables of the covariance matrix. 
A parallel track in the attempt to tackle the multivariate Gaussian integral is by applying 
some efficient numerical integration techniques (see for example Schervish [12]). Due to the 
exponential nature of these methods, they are applicable for a dimension up to around 20. 
Another track considers improvised Monte Carlo methods (Deak [H], Genz [T3j, Breslaw [5] 
and Hajivassiliou et al [IH]). Again, these methods have not been demonstrated on high 
dimensional problems. 



5 New Monte Carlo Method 
5.1 The Proposed Method 



The proposed new Monte Carlo method tackles the high- dimensional multivariate Gaussian 
integral, and thereby simultaneously evaluates the posterior probability and the marginal 



likelihood. It combines aspects of rejection sampling and bootstrap sampling. The general 
idea is to first generate samples for the first variable vi. Subsequently, we reject the points 
that fall outside the integral limits (for vi). Then we replenish in place of the discarded 
points by sampling with replacement from the existing points (i.e. the points that have 
been accepted). Next, we move on to the second variable, V2, and generate points using 
the conditional distribution p{v2\vi). Again, we reject the points of V2 that fall outside the 
integration limit, and replenish by sampling with replacement. We continue in this manner 
until we reach the final variable vn- The integral value is then estimated as the product of 
the acceptance ratios of the variables. Unlike MCMC-type methods, we do not need to 
perform additional cycles. We cycle only once through the N variables, each time generating 
a number M of points. 

Here are the detailed steps of the algorithm: 

1. For i = 1 to N perform the following: 

2. Ifi = 1 then generate M points f i(m) from p{vi). Otherwise, generate M points Vi{m) 
according to the conditional density function: 

p = pivi\vi.,i^i{m)) (32) 

where fi:j_i(m) is the m*'^ string of points (of variables Vi to fj-i) already generated 
in the previous steps. 

3. Reject the points Vi{m) that are outside the area of integration, i.e. reject the points 
Vi{m) < 0. Assume that there are Mi{i) accepted points and M2{i) = M — Mi{i) 
rejected point. 

4. Replenish in place of the rejected points, by sampling a number M2{i) points by re- 
placement from among the accepted points. 

5. Once reaching the last dimension i = N, we stop, computing the multivariate integral 
as 




5.2 The Rationale of the Algorithm 

The proof that the proposed algorithm leads to an estimate for the multivariate orthant 
probability is essentially by construction, and this is described here. The multivariate integral 
can be written as: 

/ = p(vi >0,V2>0,...,vn>0) (34) 
= p{vN>0\vi>0,...,VN-i>0)...p{v2>0\vi>0)p{vi>0) (35) 

Since we generate points according to the distribution in fl32|) . which conditions only on the 
surviving points that were not being rejected in previous rounds, the generated points will 
obey the distribution p{vi\vi > 0, . . . > 0). As such, the ratio Mi{i)/M is an estimate 
of the probability p{vi > 0\vi > 0, > 0). Using Eq. (!35|) we obtain the product 

formula ( 133|) for the overall multivariate integral. Please note that the bootstrap sampling 
step does not alter the distribution of the generated points. 



5.3 Example: 



Consider for example that we would like to compute 



oo roo POO 



0.7 JlA Jl.2 



(36) 



We generate M points from p{vi) (let M = 10). Let these points be 2.1, 0.2, 0.6, 1.8, 
2.2, 0.8, 1.3, -0.3, 1.4, 1.6. We reject the points 0.2, 0.6, 0.8, -0.3, as they are less than 
1.2 and hence are outside the area of integration. We keep the six accepted points 
2.1, 1.8, 2.2, 1.3, 1.4, 1.6, and remove the rejected points. In place of the rejected 
points, we sample a similar number (i.e. four) from among the accepted points, with 
replacement. Assume we have done that and we have obtained: 2.1, 1.3, 1.8, 2.1. So, 
overall we have the following points (fi(m)'s): 2.1, 1.8, 2.2, 1.3, 1.4, 1.6, 2.1, 1.3, 1.8, 
2.1. 



2. Generate M = 10 points from p{v2\vi{'m)) . Specifically, we generate one point ^2(1) 
using the density p(f2|fi = 2.1), then one point ^2(2) using the density p(f2|fi = 1-8), 
and so on. Let these generated points be V2{m) = 2.5, 1.5, 2.0, 1.7, 1.1, 1.5, 1.7, 1.3, 
1.9, 2.0. We reject the points 1.1 and 1.3 as they are below the integration limit of 1.4. 
Then we sample two more points by replacement in place of these rejected points. We 
continue in this manner for the remaining dimension. Assume that we rejected three 
points in this case. 



3. The integral estimate is the product of the acceptance ratios, i.e. it equals (^j-^ 



5.4 On the Convergence of the Proposed Algorithm 

Because the integral is typically a very small number, we will consider here the logarithm of 
the integral as the target value we would like to estimate, i.e., from Eq. (!35|) : 

log(/) = log[p{vN > 0\vi > 0, . . .,vn-i > 0)] + . . . + log[p(t;2 > 0\vi > 0)] + log[p(i;i > 0) 

(37) 

Assume for the time being that the values generated are independent. In every step of 
the algorithm we generate M Bernoulli trials, where each has a probability of Pi = p{vi > 
0|f 1 > 0, . . . , f j_i > 0) in landing in the integral's sought interval and being accepted for the 
subsequent steps. Let us analyze the bias and the variance. 
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where the expression in the RHS originates from a Taylor series expansion of log(Mj/M) 
around the value log(Pj), and keeping up to quadratic terms. The expectation in the second 
term in the RHS equals zero (because it is a binomial process, so the expectation of Mi 
equals MPi). The last term in the RHS can also be evaluated, and we obtain the bias as 



Bias = E 



log I 
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+ 0{M~ 



(39) 
(40) 



which means that we can have the bias as close as possible to zero, as the number of generated 
points becomes very large. 

Concerning the variance, we get 
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For the overall integral, we get 
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log(/) = Y.^og{P,) 



(41) 
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with the bias and variance becoming 
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Thus, the mean square error (MSE) goes to zero as M — y oo. Note that Pi is the outcome 
of a one-dimensional integral, so it is expected to be in the middle range of (0, 1). 

As a benchmark comparison, consider the basic Monte Carlo integration algorithm, where 
we generate a number of points according to the multivariate Gaussian distribution and 
evaluate the fraction of points falling in the area of integration. In that case the mean 
square error is (1 — I) /{MI) + 0(M~^), where / is the value of the multivariate integral. 
One can see that the MSE is very large because typically / is infinitessimally small. 

When we derived the above formula, we assumed that the samples are independent, as an 
approximation. Strictly speaking they are not, because of the following reason. Consider two 
points Vj{l) and Vj{2) generated according to p{vj\vi > 0, . . . , > 0). Tracking backwards 
from their values at dimension j and going upstream through the conditioned variables, we 
could find one variable, say f that is a common conditioned variable to the two generated 
points Vj{l) and Vj(2) (i.e. a common ancestor). This is because of the bootstrap sampling 
procedure. However, we argue that the dependence will be fairly small. This is because 
equally- valued samples will get completely dispersed when we generate samples for the next 
variable, and so the dependence will decay fast. So, the net effect of this dependence is to 
have a somewhat higher MSE, but it would still be the same order, i.e. 0{M~^), and with 
higher coefficient. (It is akin to estimating the mean of a variable using generated points 
having a banded covariance matrix, the MSE will still be 0(M~^).) 



5.5 On Generating from the Distribution p{vi\vi:i-i{m)) 



In Step 2 in the algorithm described in Subsection 15. H we need to generate from the con- 
ditional Gaussian distribution. This can be accomplished using the well known identity 
(assume mean(t>)=0): 



where 



p{vi\vi;i-i) = J\f(vi,bjvi;i-i,ai^ 

bi = Rl:i-lA:i-lRl:i-l,i 



(45) 
(46) 



— Ri,i — (47) 

and where R = I+Ai2T,'Ai2 is the covariance matrix pertaining to the multivariate Gaussian 
(see Eqs. ( I25l) and ( l28l) ). and the notation Ai-j^k-.i means the matrix constructed from A by 
taking rows i to j and columns j to k. 

We have to compute these variables in Eqs. P6|) and P7|) . including inverting a matrix 
every step, i.e. N times. We present here a computationally more efficient algorithm based 
on a recursive computation of the quantities in Eqs. fH6|) and (jUj). Assume that we have 
performed the computations at Step i, i.e. that bi, af and Qi = Ri-i-n.i-i are available. 
Proceeding to the next step i + 1, we ffist tackle Qi+i- Using the partitioned matrix inversion 
(Horn and Johnson [20j), we get 



/ Rl:i-IA 
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_ ( Rld-l,l-.i-l ~^ \Rld-l,l-.i-lRl-.i-l,i^-.i-l,iRld~l,lA~l ~\Rl-A.-l,l:i-lRl.i-l,i\ 
~ \ luT D-1 1 



k 



where k = Ru — R^i-i {Ria-i i:i-iRiA-i,i- Notice that k equals af, which is available from 
the previous step. Moreover, substituting from Eq. ( H6|) . we get 

g.+i = ^ (49) 

v ^ / 

We also get 

bi+i = Qi+iR\;i,i+i (50) 



cXj_^i — -Rj+ii+i — Ri-iij^ih+i (51) 

In summary, using these recursive formulas we can compute the moments for the conditional 
distribution using 0{N'^) instead of 0{N^) operations, thus providing some computational 
savings. 



5.6 Summary of the Algorithm 

The algorithm turns out to be very simple, and can be coded easily. It is important to 
start with the training set, then proceed with the test set. So, basically we will rename the 
variables, such that fi^Ar represents the training set, and v^+i-.n+ntest represents the test 
set. Also, for convenience, the covariance matrix of (125]) will be rearranged and will be made 
to include all test patterns, to become 




where X^, is the matrix of test patterns. As can be seen in the algorithm, the training set 
computations have to be done once, and need not be repeated for every test pattern, making 
the algorithm of incremental nature. The algorithm is described follows: 



Algorithm GPC-MC 



1. i — 1: Set Q2 = kT!''^! ~ ^lenerate M points vi{m) according to N'{vi,0,al). 
Compute 

4f(vi(m) s. t. Vi(m) > 0) 

A = ^ M — <^3' 

where the latter expression means the fraction of points that are > 0. Remove the 
points Vi{m) < 0. Sample by replacement from among the remaining points to keep 
the total number of points equal M. Rename the variables, so that Vi{m) are the new 
kept points. 

2. For i = 2 to do the following: 

(a) Compute the matrices: 

bi = QiRi:i-i,i (54) 

= R,, - i?^,_i,,6, (55) 

Q^^^ = r (56) 

V ^ / 

(b) Generate M points Vj(m) according to J\f{vi,bjvi;i_i{m),af), m = 1,...,M. 
Compute 

j^(vi{m) s. t. Vi{m) > 0) 
p. = ^ >- (87) 

(c) Remove the points fi(m) < 0. In their place, sample by replacement from among 
the remaining points to keep the total number of points equal M. Rename the 
variables, so that Vi{m) are the new kept points. 

3. The log marginal likelihood function is given by the following sum over the training 
set probabilities: 

N 

LogL = Elog(A) (58) 

i=l 

4. For i = + 1 to + NTEST (the test patterns) do the following: 

(a) Compute the matrices: 

h — QN+lRl:N,i (59) 

where Qn+i represents the covariance matrix inverse, obtained at the last training 
pattern. It will not be be updated further during the test. 

= Ri,i - Rlj,^,bi (60) 

(b) Generate M points Vi(m) according to Af{vi, bJvi:N{m), a^), m = 1, . . . , M. Com- 
pute 

4^(vi(m) s. t. vAm) > 0) 

M ^ ' 

Note that Pi is the sought test pattern posterior probability. Note also that we do 
not need to perform the bootstrap sampling step here for the test. 



6 Simulation Experiments 



6.1 Experiment 1: Testing the New Monte Carlo Integration Method 

In this experiment we test the convergence properties of the new Monte Carlo multivariate 
Gaussian integration approach. The goal here is to test the efficacy of this new method 
irrespective of its use in Gaussian process classifiers. The application of this method to the 
Gaussian process classifier hinges mainly on its success as a stand-alone integration approach. 
Once we establish this fact, we will have assurances that it would work well in the Gaussian 
process classifier setting. 

To be able to judge the new method's approximation error, we have to use examples 
where the "ground truth", i.e. the real integral value, is known. We identified a special 
form where this can be obtained. The experiments will be performed on this special form, 
described below. 

The covariance matrix equals 1 on the diagonal and equals didj off the diagonal (at the 
{i,jy^ position), where d = {di, . . . .d^)'^ is some vector with \di\ < 1. In such a situation, 
the orthant probability can be reduced to a simple one-dimensional integration, as follows: 








where a is the cumulative Gaussian function (i.e. the one- dimensional integration of the 
Gaussian density function). This formula was proposed by Das [8j, Dunnet and Sobel [10], 
and Ihm [21]. It has been also generalized to different forms by Marsaglia [29] and Webster 
[52], and also used in combination of Monte Carlo sampling by Breslaw [S]. Some special 
cases of this formula even yield some closed-form solutions. 

In this experiment we considered different dimensions for our space. Specifically, we 
considered the dimensions = 50, = 200, and = 500. In addition, for each dimension 
we considered 50 different problems, where each problem has a different d vector (whose 
components are generated from a uniform distribution in [—1,1]). To evaluate how the 
approximation error varies with the number of Monte Carlo samples M, we ran each of 
these problems for various values of M. Because the value of the integral is usually an 
infinitessimal value, a sensible approach is to consider the logarithm of the estimated integral, 
and compare it to the logarithm of the true integral. For example a typical integral value 
for a 500-dimensional problem could be 10^^°°. The logarithm becomes about -461. As an 
error measure, we used the following mean absolute percentage error, defined as: 

where / is the true integral value, / is the integral value estimated by the algorithm, and 
NR represents the number of runs (i.e. the number of different d vectors tested, in our case 
NR = 50). 

Note that we have to be careful when evaluating the true integral numerically using Eq. 
( l62l) . If we multiply the terms first and then integrate numerically, we end up with very small 
numbers, leading to a large error. We overcame this difficulty, by successive normalization 
by the maximum value after each multiplication. Then we evaluate the integral and multiply 
back the normalization terms that we divided by. 
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3,000,000 


0.012 ( 0.0011 ) 


0.015 ( 0.0065 ) 


0.043 ( 0.0299 ) 


1,000,000 


0.016 ( 0.0020 ) 


0.019 ( 0.0068 ) 


0.045 ( 0.0298 ) 


300,000 


0.041 ( 0.0050 ) 


0.032 ( 0.0072 ) 


0.050 ( 0.0297 ) 


100,000 


0.072 ( 0.0094 ) 


0.039 ( 0.0070 ) 


0.059 ( 0.0303 ) 


30,000 


0.141 ( 0.0160 ) 


0.078 ( 0.0065 ) 


0.080 ( 0.0289 ) 


10,000 


0.245 ( 0.0268 ) 


0.101 ( 0.0093 ) 


0.107 ( 0.0322 ) 



Table 1: The Mean Absolute Percentage Error (MAPE, in %) of the Log Multivariate 
Gaussian Integral Estimate (and its Standard Error in Brackets) against the Dimension of 
the Problem and the Number of Monte Carlo Samples (the Numbers are in Percent so they 
are Multiplied by 100) 

Table [T] shows the MAPE error measure (average over each of the 50 tested problems) 
for each of the tested values of (dimension) and M (number of Monte Carlo runs). Note 
that these are percent errors, so they are multiplied by 100. Displayed in the table is also the 
standard error (over the 50 tested problems). One can observe that the developed algorithm 
evaluates the orthant probabilities with good accuracy. As expected the accuracy tends to 
improve for larger M. However, the relation between the accuracy and the dimension is less 
straightforward to describe. Even though by Eqs. fH5]) and flH|) one might expect that for 
large N there will be more terms and hence a higher error, in practice the Pj's are more 
important influencing factors. One can also see that the algorithm succeeded for even the 
case of 500 dimensional problems, even though such high dimensions are quite formidable 
problems. Most of the algorithms for orthant probability estimation test on problems with 
only tens of dimension. 

Note that because of memory limitations, in case of a large number M of Monte Carlo 
samples it may not be practical to propagate all samples together. A more practical approach 
is to rerun the problem several times, each with a smaller M. For example assume that we 
would hke to use 2,000,000 samples. In that case we apply ten runs, each with M = 200, 000. 
We then average the integral estimates obtained. 

6.2 Experiment 2: Testing the New Monte Carlo Method on 
Gaussian Process Classification Problems 

The next group of experiments aims to verify that the proposed Monte Carlo method, in a 
Gaussian process classification setting, does converge to the true solution. The problem we 
face is that in general there is no way to know the true solution, and so it could be hard 
to verify this claim. However, we identified a special group of problems where the "ground 
truth" could be obtained. This is if we take the distance kernel function to be of the dot 
product form. This means that the covariance matrix equals 

S = XX^ (64) 

This is the so-called linear kernel. It is a legitimate kernel function as it represents a similarity 
between the patterns, and is positive semidefinite. For single-feature classification problems 
it can be shown after a few of lines of derivation that the class 1 posterior probability of a 



pattern, as given by Eq. fl28l) . becomes 



J ^ e 2 a(a:M) n^=i a{yiX.iu)du ^^^^ 
e""^ n^i cT{yiXiu)du 

where Xi and a; are the feature values of respectively the i*'^ training pattern and the test 
pattern, and yi denotes the class membership for pattern i. The marginal likelihood is simply 
the denominator in Eq. f l65|) . In this group of problems the covariance function becomes 
of the form discussed in the Experiment 1, where we can make use of the one-dimensional 
integration method of Eq, f lB^ to evaluate the integrals. The fact that we are dealing with 
a one-dimensional feature space does not necessarily make the problem any easier. We are 
still dealing with the same formulas and with the same very high-dimensional integrals. The 
dimension of the feature vector impacts only the covariance matrix, it will just have different 
entries. 

We generated a number of training and testing patterns from a one- dimensional (i.e. 
single-feature) two-class Gaussian problem. To ensure that the proposed model can handle 
different types of problems, we considered a variety of training/testing set sizes, a variety 
of means/variances for the class-conditional densities (in order to account for a variety of 
different class overlaps). Specifically, we considered the four problems shown below. Let N 
and NT EST be the sizes of respectively the training set and test set, and let /ij and cTj be 
respectively the mean and standard deviation of the class conditional density for class i. 



• 


Problem 1: 
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100, 


NT EST = 


50, 


/ii = 


= 0, 


1^2 = 


= 1, CTi = 0.2, (72 


= 0.3. 


• 


Problem 2: 


N = 


200, 


NT EST = 


100, 


, /ii 


= 0, 


, 1^2 


= 1, CTi = 2, (72 


= 1. 
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Problem 3: 


N = 


400, 


NT EST = 


200, 


. f^i 


= 0, 


. fJ'2 


= 1.5, (7i = 0.5, 


(72 = 0.75 
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Problem 4: 


N = 


800, 


NT EST = 


400, 


, /ii 


= 0, 


, /i2 


= 1, (7i = 1, (72 


= 0.75. 



When constructing the covariance matrix, we made a point to shuffle the training patterns 
of both classes. This is important for achieving better accuracies/speeds. The reason will be 
mentioned at the end of this subsection. To obtain statistically more reliable numbers, from 
each of the above problems we applied the proposed algorithm a number N^ = 20 different 
times. Since the estimated class 1 posterior probabilities of the different test patterns are in 
a well-known range from to 1, it is sufficient to use an absolute error metric, so we used 
the mean absolute error (MAE), defined as follows: 

]^ NTEST 

MAE = y |J„ -X,| (66) 

NTEST ^ I *Ji ^ ^ 

where J^:j is an evaluation of the class 1 posterior probability for test pattern j using an exact 
numerical integration procedure (the true value), obtained by the formula of Eq. (^B^, and 
J^:j is the estimate using the proposed Monte Carlo procedure. Table [2] shows the obtained 
MAE values, averaged over the 20 runs, for a variety of numbers of Monte Carlo samples. 

Concerning the log marginal likelihood, we evaluated it using the proposed Monte Carlo 
algorithm, and compared it with the true value, obtained numerically by evaluating the 
denominator of Eq. ( I65l) . Since the log marginal likelihood can take any level, a normalized 
error measure is more appropriate. So we used the mean absolute percent error (MAPE) 
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3,000,000 
1,000,000 
300,000 
100,000 
30,000 
10,000 


0.00016 ( 0.00004 ) 
0.00031 ( 0.00007 ) 
0.00056 ( 0.00013 ) 
0.00095 ( 0.00021 ) 
0.00160 ( 0.00036 ) 
0.00308 ( 0.00069 ) 


0.00024 ( 0.00005 ) 
0.00043 ( 0.00010 ) 
0.00081 ( 0.00018 ) 
0.00139 ( 0.00031 ) 
0.00297 ( 0.00066 ) 
0.00463 ( 0.00103 ) 


0.00022 ( 0.00005 ) 
0.00038 ( 0.00009 ) 
0.00062 ( 0.00014 ) 
0.00112 ( 0.00025 ) 
0.00212 ( 0.00047 ) 
0.00391 ( 0.00088 ) 


0.00022 ( 0.00006 ) 
0.00045 ( 0.00012 ) 
0.00075 ( 0.00019 ) 
0.00128 ( 0.00033 ) 
0.00235 ( 0.00061 ) 
0.00443 ( 0.00114 ) 



Table 2: The Mean Absolute Error (MAE) (averaged over the 20 runs) of the Gaussian 
Process Classification of Experiment 2 for the Four Different Problems against the Number 
of Monte Carlo Samples (the Standard Error is in Brackets). 



No MC 
Samples 


Prob 1 
(A^ = 100) 


Prob 2 
(A^ = 200) 


Prob 3 
(A^ = 400) 


Prob 4 
(A^ = 800) 


3,000,000 
1,000,000 
300,000 
100,000 
30,000 
10,000 


0.0081 ( 0.0018 ) 
0.0187 ( 0.0042 ) 
0.0364 ( 0.0081 ) 
0.0671 ( 0.0150 ) 
0.1170 ( 0.0262 ) 
0.1522 ( 0.0340 ) 


0.0088 ( 0.0020 ) 
0.0097 ( 0.0022 ) 
0.0255 ( 0.0057 ) 
0.0340 ( 0.0076 ) 
0.0629 ( 0.0141 ) 
0.1334 ( 0.0298 ) 


0.0063 ( 0.0014 ) 
0.0095 ( 0.0021 ) 
0.0130 ( 0.0029 ) 
0.0238 ( 0.0053 ) 
0.0450 ( 0.0101 ) 
0.0900 ( 0.0201 ) 


0.0033 ( 0.0009 ) 
0.0059 ( 0.0015 ) 
0.0077 ( 0.0020 ) 
0.0130 ( 0.0034 ) 
0.0249 ( 0.0064 ) 
0.0622 ( 0.0161 ) 



Table 3: The Mean Absolute Percent Error MAPE (in %, i.e. the Numbers are Multiplied 
by 100) of the Log Marginal Likelihood of Experiment 2 for the Four Different Problems 
against the Number of Monte Carlo Samples. All are Averages over the 20 Runs, and the 
Standard Error is in Brackets. 

measure. The formula is similar to Eq. fl63|) . but with the appropriate comparison variables 
replaced. Table [3] shows the obtained MAPE (%) values for a variety of numbers of Monte 
Carlo samples for the log marginal likelihood estimation problem. 

As seen from both tables, the algorithm is able to achieve a low error for both, the prob- 
ability evaluation and the marginal likelihood. One can also see that increasing the number 
of Monte Carlo samples M leads to better accuracy. As mentioned in the last experiment, 
the relation between accuracy and dimension is less straightforward to describe. It is in- 
fluenced more by the specific covariance matrix and the resulting conditional probabilities 
Pi. By observing Eqs. (1431) and fl44p . one finds that a small Pj can lead to large error. It 
is therefore advantageous to have the Pj's closer to the middle (in most cases it is around 

0. 5). To achieve that, it is important to shuffle the data of both classes, rather than list first 
the data for Class 1, followed by the data of Class 2. The latter will cause more extreme 
Pj's and therefore lead to less accuracy. Other than random shuffle, one could interleave the 
data of class 1 and class 2 in a repetitive way (e.g. class 1 pattern, then class 2, then class 

1, then class 2, etc). 



6.3 Experiment 3: Comparison between the New Monte Carlo 
Method and the MCMC Approach 

In this and the next experiment we present a comparison of the proposed Monte Carlo 
algorithm with the Markov Chain Monte Carlo (MCMC) approach, its only peer. The 
MCMC is the only available method that can accurately compute the exact classification 
probabilities. All other methods give only approximations. There are several MCMC based 
models. In this experiment we compare between the proposed algorithm and the Hybrid 
Monte Carlo (HMC) |33], and the Elliptical Slice Sampler (ESS) by Murray, Adams, and 
Mackay [32]. For both methods, we use the implementation written by Rasmussen and 
Nickisch [37], which includes several enhancements of these two methods. 

We considered Problem 3 (A^ = 400) of Experiment 2 (with a linear kernel). As men- 
tioned, these are the only type of problems where the ground truth is known. For the 
purpose of comparison the two main aspects of speed and accuracy are important. They 
are contradictory metrics, for example improving the accuracy (by having a larger Monte 
Carlo sample) will lead to more lengthy runs, and vice versa too. To be able to visualize 
simultaneously both of these aspects of the performance we have plotted both the CPU time 
against the logarithm of the MAE in Figure 1 (for the case of probability estimation) and 
against the logarithm of the MAPE in Figure 2 (for the case of marginal likelihood estima- 
tion). In each of the two figures every point corresponds to the average CPU time/average 
MAE (or MAPE) over ten runs for a particular Monte Carlo parameter. The Monte Carlo 
parameter for the proposed algorithm, and for the HMC and ESS algorithms is the number 
of Monte Carlo samples. For the HMC and ESS algorithms we kept the other parameters at 
their recommended values, as given in the implementation by Rasmussen and Nickisch [37] 
(they are any way much less infiuential than the number of samples). The parameters are 
fixed as follows: the number of of skipped samples is 40, the number of burn-in samples is 
10, and the number of runs to remove finite temperature bias is 3. 

To be able to judge the advantage of one algorithm versus another, one should examine 
the difference in accuracy for the same run time, or similarly the difference in run time for 
the same accuracy. One can see from the graphs that the proposed algorithm generally 
beats the HMC algorithm. The margin of outperformance is considerable for the marginal 
likelihood case. While ESS beats the proposed algorithm for pattern probability estimation, 
the proposed algorithm holds the upper hand for the marginal likelihood case. As pointed 
out before, the marginal likelihood is by far the most important of the two aspects. The 
reason is that it is evaluated numerous times in the process of tuning the hyperparameters 
of the kernel function, while the probability estimation is performed only once. For example, 
from the figure the proposed algorithm produces a log(MAPE) of the marginal likelihood 
of about -4.6 with a CPU time of 2150 sec. The ESS algorithm produces about the same 
log(MAPE) with CPU time of 5350 sec. With about a hundred application of an optimization 
algorithm (such as Rasmussen and Nickisch's GPML toolbox's optimize function [37]), the 
new algorithm takes about 60 hours while ESS takes about 148 hours. 

6.4 Experiment 4: Comparison between the New Monte Carlo 
Method and the Control Variables Method 

In this experiment we compare the proposed algorithm with the model developed by Titsias, 
Lawrence and Rattray [17] . This is one of the efficient versions in the literature for applying 
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Figure 1: Log of the Mean Absolute Error (MAE) of the Probabihty Estimates of the New 
Algorithm, the HMC Algorithm, and the ESS Algorithm against the CPU Time (Seconds) 
of the Runs 
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Figure 2: Log of the Mean Absolute Percentage Error (MAPE) of the Log Marginal Likeli- 
hood Estimates of the New Algorithm, the HMC Algorithm, and the ESS Algorithm against 
the CPU Time (Seconds) of the Runs 



MCMC to the Gaussian process problem. It relies on using dynamically optimized control 
variables that provide a low dimensional representation of the function. The code is publicly 
available at [15] . 

For the purpose of this comparison, there is however the problem of lack of ground truth, 
because the algorithm by Titsias et al does not apply to linear kernels. It applies only for 
RBF and ARD kernels [19] . In such case we do not know the true value of the integrals that 
would yield the Gaussian process class probabilities. Nevertheless, we run both competing 
models on a number of problems, and check their convergence properties. If they converge 
to the same class probabilities, then it is a strong indication that both of them do indeed 
converge. Also, we consider the RBF kernel, which is a more efficient and more widely used 
kernel than the linear kernel. So the experiments presented here are more relevant, as they 
fit more closely to the real experimental situations, because of the prevalence of use of the 
RBF kernel. 

We considered artificially generated data using Gaussian class-conditional densities. For 
these problems, using the Bayes classifier's formula, one can compute the true posterior 
probability P{yi = l\x) = p.**""^ which the Gaussian process classifier attempts to model. 
However, we must emphasize that these true posterior probabilities need not be the same 
as those obtained by GPC, as GPC is based on a different formulation. If either proposed 
algorithm or the Titsias et al algorithm does a good job converging to the true value of the 
sought integral, but it turns out to be far from the true posterior, then it is not the fault 
of the algorithm. It should be attributed to the degree of validity of the Gaussian process 
formulation or to the finite sampled-ness of the training data. Nevertheless, a comparison 
with the true posterior provides a useful sanity check. We computed the mean absolute error 
between each competing method's estimated probabilities and P^*''"^ (let us denote them by 
MAE(TITSIAS) and MAE(NEW)). 

Let d be the dimension of the feature vector, and let denote the dimensional vector 
of all ones. Also, let: 

and Sio is a 10 x 10 one-banded matrix with 0.5 on the diagonal and 0.2 on the upper and 
the lower bands. We considered the following problems, that provide a variety of different 
levels of training set sizes, class overlaps, and space dimensions. 

• Problem 1: NTRAIN = 20, NTEST = 50, d = 2, p{x\Ci) = Af{x, 0, /), pix\C2) = 
M{x,e2,T,o). 

• Problem 2: NTRAIN = 50, NTEST = 50, d = 2, p{x\Ci) = M{x, 0, /), p{x\C2) = 
A/'(a;,e2,So). 

• Problem 3: NTRAIN = 200, NTEST = 200, d = 2, p{x\Ci) = Af{x,0,I), 
pix\C2) = A/'(a;,e2,So). 

• Problem 4: NTRAIN = 50, NTEST = 50, d = 2, p{x\Ci) = Af{x, 0, /), p{x\C2) = 
Ar(x, 0.562, So). 

• Problem 5: NTRAIN = 50, NTEST = 50, d = 2, p{x\Ci) = M{x, 0, /), p{x\C2) = 
A/'(x,2e2,So). 



Problem 


MAE(NEW) 


MAE(TITSIAS) 


Problem 1 


0.132 


0.131 


Problem 2 


0.118 


0.119 


Problem 3 


0.077 


0.078 


Problem 4 


0.107 


0.107 


Problem 5 


0.145 


0.144 


Problem 6 


0.264 


0.259 


Problem 7 


0.039 


0.038 


Avg 


0.126 


0.125 



Table 4: The Mean Absolute Error over All Hyperparameter Sets for the TITSIAS Method 
and the Proposed Monte Carlo Method 

• Problem 6: NTRAIN = 50, NTEST = 50, (i = 10, p{x\Ci) = Af{x, 0, /), p{x\C2) = 
A/'(x,eio,Sio). 

• Problem 7: NTRAIN = 1000, NTEST = 1000, d = 2, p{x\Ci) = ^^{x,0,I), 
p{x\C2) = A/'(x,e2,Eo). 

In all problems we assume that the a priori probabilities are equal. 

We ran six different runs on each of these problems. These six runs consider an RBF 
covariance function S, with the following parameters: 



1. 


a 


= 5, 


/3 = 


1 


2. 


a 


= 5, 


/3 = 


5 


3. 


a 


= 3, 


/3 = 


2 


4. 


a 


= 0.5, 


/3 


= 0. 


5. 


a 


= 3, 


/3 = 


1 


6. 


a 


= 0.5, 


/3 


= 3 



For TITSIAS we considered 100,000 iterations, where the results are stored every ten iter- 
ations. Moreover, we considered 5000 iterations for the burn-in, and a starting number of 
three control variables. For each problem we first ran the TITSIAS on the first a and (3 
parameter set. We selected the number of Monte Carlo samples of the proposed method 
so that it runs in about the same time as the TITSIAS (measured by CPU time). Then, 
we fixed this number for the other a and /3 parameter combination runs. Table H] shows 
the average MAE(TITSIAS) and MAE(NEW) for all seven problem (averaged over the six 
hyperparameter combintations). Note that for Problem 7 we excluded Hyperparameter set 
6 due to the failure of TITSIAS to converge in 12 hours of a run. 
From the runs we note the following observations: 

• The runs of both the new method and the TITSIAS method lead to very close prob- 
ability estimates for all the test patterns and all the runs (typically an absolute error 
between TITSIAS probability estimates and the proposed method's estimates of less 
than 10"^). 



• Both methods lead to very similar mean absolute error with respect to the true poste- 
rior error. This error is also fairly low, especially if the size of the training set is large. 
This is an indication that the source of the descrepancy is probably the finite-sampled- 
ness of the training set, rather than an inadequacy of the GPC model. 

• The TITSIAS method was considerably slower for hyperparameter sets 4) and 6), and 
a little slower for hyperparameter sets 3) and 5). Even though we had fixed the number 
of iterations for all hyperparameter sets (at 100,000), it took about ten times as much 
to complete the run for sets 4) and 6) (compared to parameter sets 1) and 2) and 
compared to the proposed method). It seems that lower values of a lead to much 
slower runs for the TITSIAS method. In contrast, the proposed Monte Carlo method 
yields similar speeds for all parameter sets. 

• For both methods the runs take about a few minutes for a small problem (like NT RAIN = 
NT EST = 50), and about ten minutes for a medium problem (like NT RAIN = 
NT EST = 200). Of course, this is with the exception of the hyperparameter com- 
binations that lead to slow runs for the TITSIAS method. For example for Problem 

3 {NT RAIN = NT EST = 200) the proposed method with M = 1, 000, 000 took 12 
minutes using Matlab on a computer featuring an Intel duo core 13 processor. Again, 
it is hard to perform an accurate speed comparison between both algorithms, because 
we do not know the ground truth. 

6.5 Comments on the Results 

Working with general Bayesian methods often leads to high dimensional integrals. Evaluating 
such integrals can sometimes be frustrating because of the high dimensionality, and many 
Monte Carlo approaches fail. The advantage of the proposed appraoch is that it is very 
reliable. It basically works all the time, as we have not encountered a failing run. It is a 
fairly short algorithm, and is simple to code, so this will cut down on development time. The 
fact that it has no tuning parameter (other than the number of Monte Carlo samples) also 
facilitates applying the method and cuts down on time consuming tuning runs. The proposed 
algorithm has two main advantages compared to several of the MCMC algorithms. The first 
one is that it is faster for the problem of evaluating the marginal likelihood. This is important 
because of its repeated evaluation during the hyperparameter tuning step. The second is 
the fact that, unlike MCMC, it has similar speed for different values of hyperparameters. 
This facilitates considerably the hyperparameter tuning process, as we need to probe all the 
space. Having regions in the hyperparameter space where the runs are too slow will hamper 
the entire tuning process. 

Of course, the time consuming part is the hyperparameter tuning part, as we have to 
evaluate the marginal likelihood for many hyperparameter combinations. For such a case, 
it is sensible to use a smaller number M of Monte Carlo samples (for example 20,000 to 
50,000). Exact evaluation of the marginal likelihood will not much impact the optimization 
outcome. It is evidenced by Table |2]and Table Elthat small size Monte Carlo samples achieve 
very low error for the log marginal likelihood (significantly lower than the error in the class 
posterior probabilities). Also, running small samples could possibly yield three or four best 
hyperparameter sets, for which on a closer look we rerun the method using a larger M to 
differentiate between them (a classic exploration versus exploitation problem). Once the 



optimal parameters are obtained, in the classification step we can then use a larger M (for 
example 200,000 or 500,000), because then, accuracy is important. 

An interesting observation is that the impact of the dimension of the integral on the 
accuracy of the proposed method is not that large. It seems other factors weigh in more, 
such as the structure of the covariance matrix, and the resulting conditional probabilities. 
It is imperative to shuffle the class 1 and class 2 patterns, or interleave them in a regular 
way, before constructing the covariance matrix. This will lead to well-behaved conditional 
probabilities, and therefore better accuracy. 

7 Conclusions 

In this paper we derived a new formulation that simplifies the multi-integral formula for the 
Gaussian process classification (GPC) problem. The formulation, given in terms of the ratio 
of two multivariate Gaussian integrals, gives new insights, and potentially opens the door 
for better approximations. 

We also developed a Monte Carlo method for the evaluation of multivariate Gaussian 
integrals. This allows us to obtain very close to exact evaluation of the GPC probabilities 
and the marginal likelihood function. The proposed method is simple, reliable, and fast. 
As such, it should be considered as a promising candidate for researchers to test, when 
attempting to obtain exact GPC probabilities. 

Appendix 

The matrix A22 can be written as 




(68) 



where 




(69) 




(70) 




-1 _ 
22 ~ 



b-Vb-^ 



(71) 



where 



q = a^^+ b^B-^b 



(72) 



Substituting into Eq. ([21]), we get 




(73) 



where 



h' = AuB-^b 



-1 

+c"(/ + s-i)-i 

-1 

+C"(/ + S)-iSx.. 



(74) 
(75) 



(The last equahty follows from substituting for the variable a from Eq. ffTOj) (i.e. a = 
H^^Uxxt), and teleporting the resulting into the bracketed expression (/ + S^^)^^. 
The variable q can be simplified as follows: 



1 

(/ + s-i)-i 



al + l + J {I + S 



(76) 

(77) 
(78) 



The last equation follows from the definition of a (Eq. [TD]), and the definition of al Eq. flTT]) . 
and several steps of simplification. 

The first matrix in the RHS of Eq. (1731) can be simplified further, by noting that 



/ - C'{I + S 



-1^-1^/ 



C 
C'{I 
C\I 



(/ + s 



(j + s-i) -/Ic" 



(79) 
(80) 
(81) 



where we used the fact that C'^ = I because it is a diagonal matrix of I's and -I's. We get 
the final formula for A, as follows: 



A 



C'(J+S)-iSx. 



C"(/ + S)-iC" + 



(82) 



q ^ ' q 

Let US construct the following matrix, which we will subsequently try to invert. 



R 



1 



C'^x.. C\I + ^)C' 
Using the partitioned matrix inverse theory [20j, we get 

q \ < J < q 

which is the same as A in Eq. f l82|) . and that completes the proof. 



(83) 



R- 



C'(/+E)-iSx.,S3^^J/+S)-iC" 



S4) 
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